library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(xlsx)
olive_data <- read.csv("olive.csv")
olive_data <- olive_data[,-1]
olive_data$Region <- factor(olive_data$Region, levels = c(1,2,3))
ggplot(data = olive_data, aes(x = oleic, y = palmitic), colour = linolenic) + geom_point(aes(colour = linoleic))
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(colour = cut_interval(linoleic, n = 4)))
Analysis: The second graph is easier to analysis than compared to first one, the perception problem highlighted here is the difference in the channel capacity of human preception. Where capacity of distingushing hue > intensity
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(colour = cut_interval(linolenic, n = 4)))
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(size = cut_interval(linolenic, n = 4)))
## Warning: Using size for a discrete variable is not advised.
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point() + geom_spoke(aes(angle = as.numeric(cut_interval(linolenic, n = 4))*10), radius = 50)
Analysis: The discretized Linolenic with color was the easiest to detect boundary. This is becuase the channel capacity of detection is in the order of color>direction>size
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = as.numeric(Region)))
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = Region))
Analysis: Using a factor simply as number assumes that the difference between region is an increment of one, eg: Cat->Dog->Human the difference is one unit among the three, while treating them as factor does not assuming any step increment assumption. The Preattentive pattern emerged here due to distinct colour.
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = cut_interval(linoleic, n = 3), shape = cut_interval(palmitic, n = 3), size = cut_interval(palmitoleic, n = 3)))
## Warning: Using size for a discrete variable is not advised.
Analysis: Its very hard to distinguish between 27 types of combination due to no clear boundary between the regions, the perception problem demostrated here is the attentive mechanism and no individual feature of the chart (shape, color, size) helps in distinguish the boundary
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = Region, shape = cut_interval(palmitic, n = 3), size = cut_interval(palmitoleic, n = 3)))
## Warning: Using size for a discrete variable is not advised.
Analysis: Due to establishment of viusally clear boundary between the regions, the attentive mechanism has no problems scanning through the individual feature of the chart (shape, color, size) which inturns helps in distinguish the boundary
p <- (olive_data) %>% group_by(Region) %>% summarize(total_oils = sum(palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic)) %>% plot_ly(values = ~total_oils, type = 'pie', showlegend = FALSE) %>% layout(title = 'Total oils by region',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
Analysis: The plot does not conform to the practices under good visualization, it is generally not advised to use piechart because angles are harder to detect than size. Not having clear lables/legend simply adds to more overhead to process the information
ggplot(olive_data, aes(x = linoleic, y = eicosenoic)) + geom_point(aes(colour = Region)) + geom_density_2d()
ggplot(olive_data, aes(x = linoleic, y = eicosenoic)) + geom_point(aes(colour = Region))
Analysis: As evident from the two plots the density plot suggests a clusters being formed while the simple scatter plots coloured by region suggests a simpler cluster
baseball <- read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1")
Analysis: Yes its reasonable to scale the data (perform dimensionality reduction) since more than visualizing more than 4 feaures is not advised.
distance <- dist(baseball, method = "minkowski")
## Warning in dist(baseball, method = "minkowski"): NAs introduced by coercion
fit <- isoMDS(distance, k = 2)
## initial value 12.061782
## final value 12.060974
## converged
fit
## $points
## [,1] [,2]
## [1,] 174.54938 131.124243
## [2,] -182.40140 -117.454473
## [3,] 124.90457 57.401223
## [4,] 488.11041 -97.048987
## [5,] 120.04189 72.909394
## [6,] -32.45939 -38.883384
## [7,] -90.57921 -27.682332
## [8,] 76.27979 -33.403319
## [9,] 353.96012 65.107149
## [10,] 133.47471 16.396883
## [11,] -15.00169 164.933200
## [12,] -87.21574 -122.073770
## [13,] -90.05219 -323.147974
## [14,] -47.81722 16.895979
## [15,] -97.72915 -148.374660
## [16,] -268.29449 257.463231
## [17,] 69.46076 133.417708
## [18,] -89.70618 10.940488
## [19,] -121.33746 -126.790799
## [20,] -164.22874 -181.456796
## [21,] -333.00331 35.372480
## [22,] -35.77541 12.615108
## [23,] -299.53261 187.207321
## [24,] 140.32094 4.802752
## [25,] -13.14084 -219.299090
## [26,] 173.05957 56.461511
## [27,] -75.33719 196.606369
## [28,] 110.05093 -60.506235
## [29,] 29.23160 98.696348
## [30,] 50.16755 -22.229568
##
## $stress
## [1] 12.06097
#plot of solution
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",
main="Nonmetric MDS of baseball data", type="n")
text(x, y, labels = row.names(baseball), cex=.7)